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Abstract 

The problem Parsimony Haplotyping {PH) asks for the smallest set of haplotypes which can explain 
a given set of genotypes, and the problem Minimum Perfect Phylogeny Haplotyping (MPPH) asks 
for the smallest such set which also allows the haplotypes to be embedded in a perfect phylogeny, an 
evolutionary tree with biologically-motivated restrictions. For PH, we extend recent work by further 
mapping the interface between "easy" and "hard" instances, within the framework of (k,£)-bounded 
instances where the number of 2's per column and row of the input matrix is restricted. By exploring, in 
the same way, the tractability frontier of M PPH we provide the first concrete, positive results for this 
problem, and the algorithms underpinning these results offer new insights about how MPPH might 
be further tackled in the future. In addition, we construct for both PH and MPPH polynomial time 
approximation algorithms, based on properties of the columns of the input matrix. We conclude with 
an overview of intriguing open problems in PH and MPPH. 

Index Terms 

Combinatorial algorithms, Biology and genetics, Complexity hierarchies 



I. Introduction 

The computational problem of inferring biologically-meaningful haplotype data from the geno- 
type data of a population continues to generate considerable interest at the interface of biology 
and computer science/mathematics. A popular underlying abstraction for this model (in the 
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context of diploid organisms) represents a genotype as a string over a {0,1,2} alphabet, and 
a haplotype as a string over {0, 1}. The exact goal depends on the biological model being 
applied but a common, minimal algorithmic requirement is that, given a set of genotypes, a set 
of haplotypes must be produced which resolves the genotypes. 

To be precise, we are given a genotype matrix G with elements in {0, 1, 2}, the rows of which 
correspond to genotypes, while its columns correspond to sites on the genome, called SNP's. A 
haplotype matrix has elements from {0, 1}, and rows corresponding to haplotypes. Haplotype 
matrix H resolves genotype matrix G if for each row gi of G, containing at least one 2, there are 
two rows h h and h, l2 of H, such that g^j) = h h (j) for all j with h h (j) = h i2 (j) and gi(j) = 2 
otherwise, in which case we say that and h i2 resolve g, h we write gi = + h i2 , and we 
call h ix the complement of h i2 with respect to g i7 and vice versa. A row gi without 2's is itself 
a haplotype and is uniquely resolved by this haplotype, which thus has to be contained in H. 

We define the first of the two problems that we study in this paper. 

Problem: Parsimony Haplotyping (PH) 
Input: A genotype matrix G. 

Output: A haplotype matrix H with a minimum number of rows that resolves G. 

There is a rich literature in this area, of which recent papers such as [5] give a good overview. The 
problem is APX-hard [13][17] and, in terms of approximation algorithms with performance guar- 
antees, existing methods remain rather unsatisfactory, as will be shortly explained. This has led 
many authors to consider methods based on Integer Linear Programming (ILP) [5] [10] [1 1] [13] . 
A different response to the hardness is to search for "islands of tractability" amongst special, 
restricted cases of the problem, exploring the frontier between hardness and polynomial-time 
solvability. In the literature available in this direction [6][13][14][17], this investigation has 
specified classes of (k, £)-bounded instances: in a (k, i)-bounded instance the input genotype 
matrix G has at most k 2's per row and at most I 2's per column (cf. [17]). If k or £ is a "*" we 
mean instances that are bounded only by the number of 2's per column or per row, respectively. 
In this paper we supplement this "tractability" literature with mainly positive results, and in 
doing so almost complete the bounded instance complexity landscape. 

Next to the PH problem we study the Minimum Perfect Phylogeny Haplotyping (MPPH) 
model [2]. Again a minimum-size set of resolving haplotypes is required but this time under 
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the additional, biologically-motivated restriction that the produced haplotypes permit a perfect 
phylogeny, i.e., they can be placed at the leaves of an evolutionary tree within which each 
site mutates at most once. Haplotype matrices admitting a perfect phylogeny are completely 
characterised [8] [9] by the absence of the forbidden submatrix 
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Problem: Minimum Perfect Phylogeny Haplotyping (MPPH) 
Input: A genotype matrix G. 

Output: A haplotype matrix H with a minimum number of rows that resolves G and admits a 
perfect phylogeny. 

The feasibility question (PPH) - given a genotype matrix G, find any haplotype matrix H 
that resolves G and admits a perfect phylogeny, or state that no such H exists - is solvable in 
linear-time [7] [19]. Researchers in this area are now moving on to explore the PPH question 
on phylogenetic networks [18]. 

The MPPH problem, however, has so far hardly been studied beyond an NP-hardness result 
[2] and occasional comments within PH and PPH literature [4] [19] [20]. In this paper we thus 
provide what is one of the first attempts to analyse the parsimony optimisation criteria within a 
well-defined and widely applicable biological framework. We seek namely to map the MPPH 
complexity landscape in the same way as the PH complexity landscape: using the concept of 
(k, £)-boundedness. We write PH(k, i) and MPPH(k, i) for these problems restricted to (k, £)- 
bounded instances. 

Previous work and our contribution 

In [13] it was shown that PH(3,*) is APX-hard. In [6][14] it was shown that PH(2,*) is 
polynomial-time solvable. Recently, in [17], it was shown (amongst other results) that PH(A, 3) 
is APX-hard. In [17] it was also proven that the restricted subcase of PH(*,2) is polynomial- 
time solvable where the compatibility graph of the input genotype matrix is a clique. (Informally, 
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the compatibility graph shows for every pair of genotypes whether those two genotypes can use 
common haplotypes in their resolution.) 

In this paper, we bring the boundaries between hard and easy classes closer by showing that 
PH(3,3) is APX-hard and that PH(*, 1) is polynomial-time solvable. 

As far as MPPH is concerned there have been, prior to this paper, no concrete results 
beyond the above mentioned NP-hardness result. We show that MPPH(3, 3) is APX-hard and 
that, like their PH counterparts, MPPH(2, *) and MPPH(*, 1) are polynomial-time solvable 
(in both cases using a reduction to the PH counterpart). We also show that the clique result 
from [17] holds in the case of MPPH(*, 2) as well. As with its PH counterpart the complexity 
of MPPH(*,2) remains open. 

The fact that both PH and MPPH already become APX-hard for (3, 3)-bounded instances 
means that, in terms of deterministic approximation algorithms, the best that we can in gen- 
eral hope for is constant approximation ratios. Lancia et al [13][14] have given two separate 
approximation algorithms with approximation ratios of ^fn and 2 k ~ 1 respectively, where n is 
the number of genotypes in the input, and k is the maximum number of 2's appearing in a 
row of the genotype matrb|j]. An O(logn) approximation algorithm has been given in [21] but 
this only runs in polynomial time if the set of all possible haplotypes that can participate in 
feasible solutions, can be enumerated in polynomial time. The obvious problem with the 2 k ~ 1 
and the 0(\ogn) approximation algorithms is thus that either the accuracy decays exponentially 
(as in the former case) or the running time increases exponentially (as in the latter case) with an 
increasing number of 2's per row. Here we offer a simple, alternative approach which achieves 
(in polynomial time) approximation ratios linear in £ for PH(*,£) and MPPH(*, €) instances, 
and actually also achieves these ratios in polynomial time when I is not constant. These ratios are 
shown in the Table H note how improved ratios can be obtained if every genotype is guaranteed 
to have at least one 2. 

We have thus decoupled the approximation ratio from the maximum number of 2's per row, and 
instead made the ratio conditional on the maximum number of 2's per column. Our approximation 
scheme is hence an improvement to the 2 fe ~ 1 -approximation algorithm except in cases where 

'it would be overly restrictive to write PH(k,*) here because their algorithm runs in polynomial time even if k is not a 
constant. 
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TABLE I 

Approximation ratios achieved in this paper 



Problem (£ > 2) 


Approximation ratio 


PH(*,£) 


2 l ~ 2 


PH(*,£) where every genotype has at least one 2 


3 f , 7 3 1 

4 ~ 4 2 f+1 


MPPH{*,£) 


2^ 


MPPH(*,£) where every genotype has at least one 2 


/ + 2 — 

£ + z e+i 



the maximum number of 2's per row is exponentially small compared to the maximum number 
of 2's per column. Our approximation scheme yields also the first approximation results for 
MPPH. 

As explained by Sharan et al. in their "islands of tractability" paper [17], identifying tractable 
special classes can be practically useful for constructing high-speed subroutines within ILP 
solvers, but perhaps the most significant aspect of this paper is the analysis underpinning the 
results, which - by deepening our understanding of how this problem behaves - assists the search 
for better, faster approximation algorithms and for determining the exact shorelines of the islands 
of tractability. 

Furthermore, the fact that - prior to this paper - concrete and positive results for MPPH had 
not been obtained (except for rather pessimistic modifications to ILP models [5]), means that 
the algorithms given here for the MPPH cases, and the data structures used in their analysis 
(e.g. the restricted compatibility graph in Section [III]), assume particular importance. 

Finally, this paper yields some interesting open problems, of which the outstanding (*,2) 
case (for both PH and MPPH) is only one; prominent amongst these questions (which are 
discussed at the end of the paper) is the question of whether MPPH and PH instances are 
inter-reducible, at least within the bounded-instance framework. 

The paper is organised as follows. In Section |n] we give the hardness results, in Section ITTll we 
present the polynomial-time solvable cases, in Section [IV] we give approximation algorithms and 
we finish in Section [V] with conclusions and open problems. 

II. Hard problems 
Theorem 1: MPPH(3,3) is APX-hard. 
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Proof: The proof in [2] that MPPH is NP-hard uses a reduction from Vertex Cover, 
which can be modified to yield NP-hardness and APX-hardness for (3,3)-bounded instances. 
Given a graph T = (V, E) the reduction in [2] constructs a genotype matrix G(T) of MPPH 
with \V | + \E\ rows and 2\V\ + \E\ columns. For every vertex G V there is a genotype (row) 
gi in G(T) with g^i) = 1, g>j(z + | V|) = 1 and gi{j) = for every other position j. In addition, 
for every edge e k = {v h , v t } there is a genotype g k with g> fe (/i) = 2, = 2, 0*(2|V| + k) = 2 
and <7jfc(j) = for every other position j. Bafna et al. [2] prove that an optimal solution for 
MPPH with input G(T) contains \ V\ + \E\ + VC(T) haplotypes, where VC(T) is the size of 
the smallest vertex cover in T. 

3-Vertex Cover is the vertex cover problem when every vertex in the input graph has at 
most degree 3. It is known to be APX-hard [15][1]. Let T be an instance of 3-Vertex Cover. 
We assume that T is connected. Observe that for such a T the reduction described above yields 
a MPPH instance G(T) that is (3, 3) -bounded. We show that existence of a polynomial-time 
(1 + e) approximation algorithm A(e) for MPPH would imply a polynomial-time (1 + e') 
approximation algorithm for 3-Vertex Cover with e' = 8e[] 

Let t be the solution value for MPPH(G(T)) returned by A(e), and t* the optimal value for 
MPPH(G(T)). By the argument mentioned above from [2] we obtain a solution with value 
d = t-\V\- \E\ as an approximation of VC(T). Since t < (1 + e)t*, we have d < VC(T) + 
eVC(T) + e\V\ + e\E\. Connectedness of T implies that |V| - 1 < \E\. In 3-Vertex Cover, 
a single vertex can cover at most 3 edges in T, implying that VC(T) > \E\/3 > (\V\ — l)/3. 
Hence, \V\ < WC(T) (for |V| > 2) and we have (if \V\ > 2): 

d < VC(T) + eVC(T) + AeVC(T) + 3eVC(T) 

< VC(T) + 8eVC(T) 

< (l + 8e)VC(T). 

U 

Theorem 2: PH(3, 3) is APX-hard. 

'Strictly speaking this is insufficient to prove APX-hardness but it is not difficult to show that the described reduction is 
actually an L-reduction [15], from which APX-hardness follows. 
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Proof: The proof by Sharan et al. [17] that PH(A, 3) is APX-hard can be modified slightly 
to obtain APX-hardness of PH(3, 3). The reduction is from 3-Dimensional Matching with 
each element occurring in at most three triples (3DM3): given disjoint sets X, Y and Z containing 
v elements each and a set C — {c , . . . , c M _i} of \i triples in X x Y x Z such that each element 
occurs in at most three triples in C, find a maximum cardinality set C C C of disjoint triples. 

From an instance of 3DM3 we build a genotype matrix G with 3v + 3/z rows and 6v + Ap, 
columns. The first 3z/ rows are called element- genotypes and the last 3/i rows are called matching- 
genotypes. We specify non-zero entries of the genotypes onlyO For every element x.- t G X define 
element-genotype gf with gf (3u + i) = 1; gf(6u + Ak) = 2 for all k with x,- L G Ck- If Xi occurs 
in at most two triples we set gf (i) = 2. For every element yi EY there is an element-genotype 
gf with gf [Av + i) = 1; gf(6v + Ak) = 2 for all k with e and if ?/j occurs in at most two 
triples then we set gf(v-\-i) — 2. For every element G Z there is an element-genotype gf with 
gf (5z/ + i) = 1; gf (6z/ + Ak) = 2 for all k with G Ck and if Zj occurs in at most two triples 
then we set gf(2u + i) =2. For each triple c k = {x^, y i2 , z i3 } G C there are three matching- 
genotypes c x k , c y k and c z k : cf, has c x k {3v + i x ) = 2, c^(6z/ + 4/c) = 1 and cl(6u + Ak + 1) = 2; 

has c^(4z/ + z 2 ) = 2, c|(6z/ + 4fc) = 1 and c y k (Qu + Ak + 2) =2; ^ has c^(5z/ + i 3 ) = 2, 
c^(6z/ + 4fc) = 1 and c z k (6u + Ak + 3) = 2. 

Notice that the element-genotypes only have a 2 in the first 3i> columns if the element occurs 
in at most two triples. This is the only difference with the reduction from [17], where every 
element-genotype has a 2 in the first 3u columns: i.e., for elements G X, yi G Y or G Z a 
2 in column i, v + z or 2u + i, respectively. As a direct consequence our genotype matrix has 
only three 2's per row in contrast to the four 2's per row in the original reduction. 

We claim that for this (3,3)-bounded instance exactly the same arguments can be used as for the 
(4,3)-bounded instance. In the original reduction the left-most 2's ensured that, for each element- 
genotype, at most one of the two haplotypes used to resolve it was used in the resolution of other 
genotypes. Clearly this remains true in our modified reduction for elements appearing in two or 
fewer triples, because the corresponding left-most 2's have been retained. So consider an element 
Xi appearing in three triples and suppose, by way of contradiction, that both haplotypes used to 
resolve gf are used in the resolution of other genotypes. Now, the 1 in position 3u + i prevents 

2 Only in this proof we index haplotypes, genotypes and matrices starting with 0, which makes notation consistent with [17]. 
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this element-genotype from sharing haplotypes with other element-genotypes, so genotype gf 
must share both its haplotypes with matching-genotypes. Note that, because gf(3u + i) — 1, 
the genotype gf can only possibly share haplotypes with matching-genotypes corresponding to 
triples that contain Xj. Indeed, if x { is in triples c kl , c k2 and c k3 then the only genotypes with 
which gf can potentially share haplotypes are c kl , c k2 and c ks . Genotype gf cannot share both its 
haplotypes with the same matching-genotype (e.g. c ki ) because both haplotypes of gf will have 
a 1 in column 3u + i whilst only one of the two haplotypes for will have a 1 in that column. 
So, without loss of generality, gf is resolved by a haplotype that q£ uses and a haplotype that 
c k2 uses. However, this is not possible, because gf has a 2 in the column corresponding to c k:i , 
whilst both q£ and q£ have a in that column, yielding a contradiction. 

Note that, in the original reduction, it was not only true that each element-genotype shared at 
most one of its haplotypes, but - more strongly - it was also true that such a shared haplotype 
was used by exactly one other genotype (i.e. the genotype corresponding to the triple the element 
gets assigned to). To see that this property is also retained in the modified reduction observe 
that if (say) gf shares one haplotype with two genotypes <f ki and c k2 then Xi must be in both 
triples c kl and c k , 2 , but this is not possible because, in the two columns corresponding to triples 
c kl and c k , 2 , c x k has 1 and whilst cL has and 1. 



III. Polynomial-time solvability 

A. Parsimony haplotyping 

We will prove polynomial-time solvability of PH on (*,l)-bounded instances. 

We say that two genotypes g 1 and g 2 are compatible, denoted as g 1 ~ g 2 , if gi(j) = fi^C/) 
or 9i(j) = 2 ox g 2 (j) = 2 for all j. A genotype g and a haplotype h are consistent if h can be 
used to resolve g, ie. if g(j) = h(j) or g(j) = 2 for all j. The compatibility graph is the graph 
with vertices for the genotypes and an edge between two genotypes if they are compatible. 

Lemma 1: If g\ and g 2 are compatible rows of a genotype matrix with at most one 2 per 
column then there exists exactly one haplotype that is consistent with both gi and g 2 . 

Proof: The only haplotype that is consistent with both g x and g 2 is h with h(j) = gi(j) 
for all j with gi(j) ^ 2 and h(j) = g^O) for all j with g 2 (j) ^ 2. There are no columns where 
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Fig. 1. Example of a genotype matrix and the con 
(0, 0, 1, 0, 0, 0, 1) and h 3 = (1, 0, 0, 0, 0, 0, 1). 




ling compatibility graph, with hi = (0,0,1,1,0,0,1), /12 = 



gi and g 2 are both equal to 2 because there is at most one 2 per column. In columns where g± 
and g 2 are both not equal to 2 they are equal because gi and g 2 are compatible. 

■ 

We use the notation g\ ~^ g 2 if gi and #2 are compatible and h is consistent with both. We 
prove that the compatibility graph has a specific structure. A 1-sum of two graphs is the result 
of identifying a vertex of one graph with a vertex of the other graph. A 1-sum of n + 1 graphs is 
the result of identifying a vertex of a graph with a vertex of a 1-sum of n graphs. See Figured] 
for an example of a 1-sum of three cliques (K 3 , K4 and K 2 ). 

Lemma 2: If G is a genotype matrix with at most one 2 per column then every connected 
component of the compatibility graph of G is a 1-sum of cliques, where edges in the same clique 
are labelled with the same haplotype. 

Proof: Let C be the compatibility graph of G and let g 1: g 2 , ■ ■ ■ , gu be a cycle in C. It suffices 
to show that there exists a haplotype h c such that g { g^ for all G {1, ...,&}. Consider 
an arbitrary column j. If there is no genotype with a 2 in this column then g\ ~ g 2 ~ . . . ~ gu 
implies that g 1 (j) = g 2 (j) = . . . = gk(j)- Otherwise, let be the unique genotype with a 2 in 
column j. Then g\ ~ g 2 ~ . . . ~ g^ ._ x together with g x ~ g fc ~ g fc _x ~ . . . ~ <7i +i implies that 
= 9i'U) for all 2, i! e {1, A;} \ {ij}. Set h c (j) = g t (J), i ^ ij. Repeating this for each 
column j produces a haplotype h c such that indeed g^ gr^ for all 6 {1, fc}. 

■ 

From this lemma, it follows directly that in PH(*, 1) the compatibility graph is chordal, 
meaning that all its induced cycles are triangles. Every chordal graph has a simplicial vertex, 
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a vertex whose (closed) neighbourhood is a clique. Deleting a vertex in a chordal graph gives 
again a chordal graph (see for example [3] for an introduction to chordal graphs). The following 
lemma leads almost immediately to polynomial solvability of PH(*, 1). We use set-operations 
for the rows of matrices: thus, e.g., h E H says h is a row of matrix H, H U h says h is added 
to H as a row, and H' C H says H' is a submatrix consisting of rows of H. 

Lemma 3: Given haplotype matrix H' and genotype matrix G with at most one 2 per column 
it is possible to find, in polynomial time, a haplotype matrix H that resolves G, has H' as a 
submatrix and has a minimum number of rows. 

Proof: The proof is constructive. Let problem (G, H') denote the above problem on input 
matrices G and H' . Let C be the compatibility graph of G, which implied by Lemma |2] is 
chordal. Suppose g corresponds to a simplicial vertex of C, Let h c be the unique haplotype 
consistent with any genotype in the closed neighbourhood clique of g. We extend matrix H' to 
H" and update graph C as follows. 

1) If g has no 2's it can be resolved with only one haplotype h — g. We set H" = PL' U h 
and remove g from C. 

2) Else, if there exist rows hi E H' and h 2 E H' that resolve g we set H" = H' and remove 
g from C. 

3) Else, if there exists h% E H' such that g = hi + h c we set H" = H' U h c and remove g 
from C. 

4) Else, if there exists hi E H' and h 2 £ H' such that g = hi + h 2 we set if" = H' U h 2 and 
remove g from C. 

5) Else, if g is not an isolated vertex in C then there exists a haplotype hi such that <? = hi+h c 
and we set H" — H' U {/ii, /i c } and remove <? from C. 

6) Otherwise, g is an isolated vertex in C and we set H" = H' U /i 2 } for any hi and /i 2 
such that g = hi + h 2 and remove <? from C. 

The resulting graph is again chordal and we repeat the above procedure for H' = H" until 
all vertices are removed from C. Let H be the final haplotype matrix H" . It is clear from the 
construction that H resolves G. 

We prove that H has a minimum number of rows by induction on the number of genotypes. 
Clearly, if G has only one genotype the algorithm constructs the only, and hence optimal, solution. 
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The induction hypothesis is that the algorithm finds an optimal solution to the problem (G, H') 
for any haplotype matrix H' if G has at most n — 1 rows. Now consider haplotype matrix H' and 
genotype matrix G with n rows. The first step of the algorithm selects a simplicial vertex g and 
proceeds with one of the cases 1 to 6. The algorithm then finds (by the induction hypothesis) 
an optimal solution H to problem (G \ {g}, H"). It remains to prove that H is also an optimal 
solution to problem (G,H'). We do this by showing that an optimal solution H* to problem 
(G, H') can be modified to include H" . We prove this for every case of the algorithm separately. 

1) In this case h e H* , since g can only be resolved by h. 

2) In this case H" = H' and hence H" C H*. 

3) Suppose that h c £ H*. Because we are not in case 2 we know that there are two rows 
in H* that resolve g and at least one of the two, say h*, is not a row of H'. Since h c is 
the unique haplotype consistent with (the simplicial) g and any compatible genotype, h* 
can not be consistent with any other genotype than g. Thus, replacing h* by h c gives a 
solution with the same number of rows but containing h c . 

4) Suppose that h 2 H* . Because we are not in case 2 or 3 we know that there is a haplotype 
h* E H* consistent with g, h* H' and h* h c . Hence it is not consistent with any other 
genotypes than g and we can replace h* by h 2 . 

5) Suppose that hi H* or h c H*. Because we are not in case 2, 3 or 4, there are 
haplotypes h* e H\H' and h** 6 H\H' that resolve g. If h* and h** are both not equal 
to h c then they are not consistent with any other genotype than g. Replacing h* and h** 
by hi and h c leads to another optimal solution. If one of h* and h** is equal to h c then 
we can replace the other one by h\. 

6) Suppose that hi ^ H* or h 2 H*. There are haplotypes h*,h** e H*\H' that resolve 
g and just g since g is an isolated vertex. Replacing h* and h** by hi and h 2 gives an 
optimal solution containing h\ and h 2 . 

U 

Theorem 3: The problem PH(*, 1) can be solved in polynomial time. 
Proof: The proof follows from Lemma [3J Construction of the compatibility graph takes 
0(n 2 m) time, for an n times m input matrix. Finding an ordering in which to delete the simplicial 
vertices can be done in time 0(n 2 ) [16] and resolving each vertex takes 0(n 2 m) time. The overall 
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running time of the algorithm is therefore 0(n 3 m). 



B. Minimum perfect phylogeny haplotyping 

Polynomial-time solvability of PH on (2, *)-bounded instances has been shown in [6] and [14]. 
We prove it for MPPH(2, *). We start with a definition. 

Definition 1: For two columns of a genotype matrix we say that a reduced resolution of these 
columns is the result of applying the following rules as often as possible to the submatrix induced 
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, for a G {0, 1}. 



Note that two columns can have more than one reduced resolution if there is a genotype with 
a 2 in both these columns. The reduced resolutions of a column pair of a genotype matrix G 
are submatrices of (or equal to) F and represent all possibilities for the submatrix induced by 
the corresponding two columns of a minimal haplotype matrix H resolving G, after collapsing 
identical rows. 

Theorem 4: The problem MPPH(2, *) can be solved in polynomial time. 
Proof: We reduce MPPH(2, *) to PH(2,*), which can be solved in polynomial time (see 
above). Let G be an instance of MPPH(2, *). We may assume that any two rows are different. 

Take the submatrix of any two columns of G. If it does not contain a [2 2] row, then in terms 
of Definition [T] there is only one reduced resolution. If G contains two or more [2 2] rows then, 



2 2 
2 2 1 



and therefore 



2 
2 1 



since by assumption all genotypes are different, G must have 

as a submatrix, which can only be resolved by a haplotype matrix containing the forbidden 
submatrix F. It follows that in this case the instance is infeasible. If it contains exactly one [2 2] 
row, then there are clearly two reduced resolutions. Thus we may assume that for each column 
pair there are at most two reduced solutions. 

Observe that if for some column pair all reduced resolutions are equal to F the instance is 
again infeasible. On the other hand, if for all column pairs none of the reduced resolutions is 
equal to F then MPPH(2, *) is equivalent to PH(2, *) because any minimal haplotype matrix 



12 



H that resolves G admits a perfect phylogeny. Finally, consider a column pair with two reduced 
resolutions, one of them containing F. Because there are two reduced resolutions there is a 
genotype g with a 2 in both columns. Let hi and h 2 be the haplotypes that correspond to the 
resolution of g that does not lead to F. Then we replace g in G by hi and h 2 , ensuring that a 
minimal haplotype matrix H resolving G can not have F as a submatrix in these two columns. 

Repeating this procedure for every column pair either tells us that the matrix G was an 
infeasible instance or creates a genotype matrix G' such that any minimal haplotype matrix H 
resolves G' if and only if H resolves G, and H admits a perfect phylogeny. 

■ 

Theorem 5: The problem MPPH(*, 1) can be solved in polynomial time. 
Proof: Similar to the proof of Theorem |4] we reduce MPPH(*, 1) to PH(*, 1). As there, 
consider for any pair of columns of the input genotype matrix G its reduced resolutions, according 
to Definition [TJ Since G has at most one 2 per column there is at most one genotype with 2's 
in both columns. Hence there are at most two reduced resolutions. If all reduced resolutions are 
equal to the forbidden submatrix F the instance is infeasible. If on the other hand for all column 
pairs no reduced resolution is equal to F then in fact MPPH(*, 1) is equivalent to PH(*, 1), 
because any minimal haplotype matrix resolving G admits a perfect phylogeny. 

As in the proof of Theorem @] we are left with considering column pairs for which one of the 
two reduced resolutions is equal to F. For such a column pair there must be a genotype g that 
has 2's in both these columns. The other genotypes have only O's and l's in them. Suppose we 
get a forbidden submatrix F in these columns of the solution if g is resolved by haplotypes hi 
and h 2 , where hi has a and b and therefore h 2 has 1 — a and 1 — b in these columns, a, b e {0, 1}. 
We will change the input matrix G such that if g gets resolved by such a forbidden resolution 
these haplotypes are not consistent with any other genotypes. We do this by adding an extra 
column to G as follows. The genotype g gets a 1 in this new column. Every genotype with a 
and b or with 1 — a and 1 — b in the considered columns gets a in the new column. Every 
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other genotype gets a 1 in the new column. For example, the matrix 

~2 2] [2 2 l" 

1 Oil 

gets one extra column and becomes 
10 10 1 

1 lj |_1 1 

Denote by G moc [ the result of modifying G by adding such a column for every pair of columns 
with exactly one 'bad' and one 'good' reduced resolution. It is not hard to see that any optimal 
solution to PH(*, 1) on G mo d can be transformed into a solution to MPPH(*, 1) on G of the 
same cardinality (indeed, any two haplotypes used in a forbidden resolution of a genotype g 
in G mo d are not consistent with any other genotype of G mo d, and hence may be replaced by 
two other haplotypes resolving g in a non-forbidden way). Now, let H be an optimal solution 
to MPPH(*, 1) on G. We can modify H to obtain a solution to PH(*, 1) on G mo d of the 
same cardinality as follows. We modify every haplotype in H in the same way as the genotypes 
it resolves. From the construction of G mo d it follows that two compatible genotypes are only 
modified differently if the haplotype they are both consistent with is in a forbidden resolution. 
However, in H no genotypes are resolved with a forbidden resolution since if is a solution to 
MPPH(*, 1). We conclude that optimal solutions to PH(*, 1) on G mo d correspond to optimal 
solutions to MPPH(*, 1) on G and hence the latter problem can be solved in polynomial time, 
by Theorem [3j 

If we use the algorithm from the proof of Lemma [3] as a subroutine we get an overall running 
time of 0(n 3 m 2 ), for an n x m input matrix. 

■ 

The borderline open complexity problems are now PH(*, 2) and MPPH(*, 2). Unfortunately, 
we have not found the answer to these complexity questions. However, the borders have been 
pushed slightly further. In [17] PH(*,2) is shown to be polynomially solvable if the input 
genotypes have the complete graph as compatibility graph, we call this problem PH(*,2)-Cl. 
We will give the counterpart result for MPPH(*, 2)-Cl. 

Let G be an n x m MPPH(*, 2)-Cl input matrix. Since the compatibility graph is a clique, 
every column of G contains only one symbol besides possible 2's. If we replace in every 1 -column 
of G (a column containing only l's and 2's) the l's by O's and mark the SNP corresponding to 
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this column 'flipped', then we obtain an equivalent problem on a {0, 2}-matrix G' . To see that this 
problem is indeed equivalent, suppose H' is a haplotype matrix resolving this modified genotype 
matrix G' and suppose H' does not contain the forbidden submatrix F. Then by interchanging 
O's and l's in every column of H' corresponding to a flipped SNP, one obtains a haplotype 
matrix H without the forbidden submatrix which resolves the original input matrix G. And vice 
versa. Hence, from now on we will assume, without loss of generality, that the input matrix G 
is a {0, 2}-matrix. 

If we assume moreover that n > 3, which we do from here on, the trivial haplotype h t defined 
as the all-0 haplotype of length m is the only haplotype consistent with all genotypes in G. 

We define the restricted compatibility graph C R (G) of G as follows. As in the normal 
compatibility graph, the vertices of Cr{G) are the genotypes of G. However, there is an edge 
{<?,<?'} in Cr(G) only if g ~^ g' for some h ^ h t , or, equivalently, if there is a column where 
both g and g' have a 2. 

Lemma 4: If G is a feasible instance of MPPH(*, 2)-Cl then every vertex in C R (G) has 
degree at most 2. 

Proof: Any vertex of degree higher than 2 in C R (G) implies the existence in G of submatrix: 



B = 



2 2 2 

2 

2 

2 



It is easy to verify that no resolution of this submatrix permits a perfect phylogeny. 

■ 

Suppose that G has two identical columns. There are either 0, 1 or 2 rows with 2's in both 
these columns. In each case it is easy to see that any haplotype matrix H resolving G can be 
modified, without introducing a forbidden submatrix, to make the corresponding columns in H 
equal as well (simply delete one column and duplicate another). This leads to the first step of 
the algorithm A that we propose for solving MPPH(*,2)-C1: 

Step 1 of A: Collapse all identical columns in G. 
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From now on, we assume that there are no identical columns. Let us partition the genotypes 
in G , G\ and G 2 , denoting the set of genotypes in G with, respectively, degree 0,1, and 2 in 
C R (G). For any genotype g of degree 1 in C R (G) there is exactly one genotype with a 2 in 
the same column as g. Because there are no identical columns, it follows that any genotype g 
of degree 1 in Cr(G) can have at most two 2's. Similarly any genotype of degree 2 in Cr(G) 
has at most three 2's. Accordingly we define G\ and G\ as the genotypes in Gi that have one 
2 and two 2's, respectively, and similarly G\ and G\ as the genotypes in G 2 with two and three 
2's, respectively. 

The following lemma states how genotypes in these sets must be resolved if no submatrix F 
is allowed in the solution. If genotype g has k 2's we denote by g[a^ a 2 , . . . , a k ] the haplotype 
with entry aj in the position where g has its i-th 2 and everywhere else. 

Lemma 5: A haplotype matrix is a feasible solution to the problem MPPH(*, 2)-Cl if and 
only if all genotypes are resolved in one of the following ways: 

(i) A genotype g E G\ is resolved by g[l] and g[0] = h t . 

(ii) A genotype g E G\ is resolved by g[0, 1] and g[l, 0]. 

(Hi) A genotype g E G\ is either resolved by g[0, 0] = h t and g[l, 1] or by g[0, 1] and g[l, 0]. 
(iv) A genotype g E G\ is either resolved by g[l, 0, 0] and g[0, 1, 1] or by #[0, 1, 0] and g[l, 0, 1] 
(assuming that the two neighbours of g have a 2 in the first two positions where g has a 2). 

Proof: A genotype g E G\ has degree 2 in Cr(G), which implies the existence in G of a 
submatrix: 

2 2 
D= g> 2 • 
^" [o 2 

Resolving <? with g[0,0] and #[1,1] clearly leads to the forbidden submatrix F. Similarly, 
resolving a genotype g E G% with g[0, 0, 1] and #[1,1, 0] or with g[0, 0, 0] and #[1,1, 1] leads to 
a forbidden submatrix in the first two columns where g has a 2. It follows that resolving the 
genotypes in a way other than described in the lemma yields a haplotype matrix which does not 
admit a perfect phylogeny. 

Now suppose that all genotypes are resolved as described in the lemma and assume that there 
is a forbidden submatrix F in the solution. Without loss of generality, we assume F can be found 
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in the first two columns of the solution matrix. We may also assume that no haplotype can be 
deleted from the solution. Then, since F contains [1 1], there is a genotype g starting with [2 2]. 
Since there are no identical columns there are only two possibilities. The first possibility is that 
there is exactly one other genotype g' with a 2 in exactly one of the first two columns. Since 
all genotypes different from g and g' start with [0 0], none of the resolutions of g can have 
created the complete submatrix F. Contradiction. The other possibility is that there is exactly 
one genotype with a 2 in the first column and exactly one genotype with a 2 in the second 
column, but these are different genotypes, i.e. we have the submatrix D. Then g E G\ or g E G\ 
and it can again be checked that none of the resolutions in (ii) and (iv) leads to the forbidden 
submatrix. 

■ 

Lemma 6: Let G be an instance of MPPH(*,2) and G\, G\ as defined above. 

(i) Any nontrivial haplotype is consistent with at most two genotypes in G. 

( ii) A genotype g E G\ U G\ must be resolved using at least one haplotype that is not consistent 
with any other genotype. 

Proof: (i) Let h be a nontrivial haplotype. There is a column where h has a 1 and there 
are at most two genotypes with a 2 in that column. 

(ii) A genotype g E G\\JG\ has a 2 in a column that has no other 2's. Hence there is a haplotype 
with a 1 in this column and this haplotype is not consistent with any other genotypes. 

■ 

A haplotype that is only consistent with g is called a private haplotype of g. Based on (i) and 
(ii) of Lemma [5] we propose the next step of A: 

Step 2 of A: Resolve all g E G\uGl by the unique haplotypes allowed to resolve them according 
to Lemma [5] Also resolve each g E G with h t and the complement of h t with respect to g. 
This leads to a partial haplotype matrix iff . 

The next step of A is based on Lemma [6] (ii). 

Step 3 of A: For each g E G\ U with g ~ h , g' for some h' E iff that is allowed to resolve 
g according to Lemma [51 resolve g by adding the complement h" of h' w.r.t. g to the set of 
haplotypes, i.e. set iff := iff U {h"}, and repeat this step as long as new haplotypes get added. 
This leads to partial haplotype matrix iff. 
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Notice that does not contain any haplotype that is allowed to resolve any of the genotypes 
that have not been resolved in Steps 2 and 3. Let us denote this set of leftover, unresolved 
haplotypes by GL, the degree 1 vertices among those by GLi C G\, and the degree 2 vertices 
among those by GL 2 C G\. The restricted compatibility graph induced by GL, which we denote 
by Cr(GL) consists of paths and circuits. We first give the final steps of algorithm A and argue 
optimality afterwards. 

Step 4 of A: Resolve each cycle in Cr(GL), necessarily consisting of GL 2 -vertices, by starting 
with an arbitrary vertex and, following the cycle, resolving each next pair g,g' of vertices by 
haplotype h ^ h t such that g g' and the two complements of h w.r.t. g and g' respectively. 
In case of an odd cycle the last vertex is resolved by any pair of haplotypes that is allowed to 
resolve it. Note that h has a 1 in the column where both g and g' have a 2 and otherwise 0. It 
follows easily that g and g' are both allowed to use h (and its complement) according to (iv) of 
Lemma [5] 

Step 5 of A: Resolve each path in Cr(GL) with both endpoints in GL\ by first resolving the 
GL\ endpoints by the trivial haplotype h t and the complements of h t w.r.t. the two endpoint 
genotypes, respectively. The remaining path contains only GL 2 -vertices and is resolved according 
to Step 6. 

Step 6 of A: Resolve each remaining path by starting in (one of) its GL 2 -endpoint(s), and 
following the path, resolving each next pair of vertices as in Step 4. In case of a path with 
an odd number of vertices, resolve the last vertex by any pair of haplotypes that is allowed to 
resolve it in case it is a GL 2 -vertex, and resolve it by the trivial haplotype and its complement 
w.r.t. the vertex in case it is a GL\ vertex. 

By construction the haplotype matrix H resulting from A resolves G. In addition, from 
Lemma [5] follows that H admits a perfect phylogeny. 

To argue minimality of the solution, first observe that the haplotypes added in Step 2 and Step 
3 are unavoidable by Lemma [5] (i) and (ii) and Lemma [6] (ii). Lemma [6] tells us moreover that the 
resolution of a cycle of k genotypes in GL 2 requires at least k + |~|] haplotypes that can not be 
used to resolve any other genotypes in GL. This proves optimality of Step 4. To prove optimality 
of the last two steps we need to take into account that genotypes in GL\ can potentially share 
the trivial haplotype. Observe that to resolve a path with k vertices one needs at least k + [|] 
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haplotypes. Indeed A does not use more than that in Steps 5 and 6. Moreover, since these paths 
are disjoint, they cannot share haplotypes for resolving their genotypes except for the endpoints 
if they are in GLi, which can share the trivial haplotype. Indeed, A exploits the possibility of 
sharing the trivial haplotype in a maximal way, except on a path with an even number of vertices 
and one endpoint in GL\. Such a path, with k (even) vertices, is resolved in A by 3| haplotypes 
that can not be used to resolve any other genotypes. The degree 1 endpoint might alternatively be 
resolved by the trivial haplotype and its complement w.r.t. the corresponding genotype, adding 
the latter private haplotype, but then for resolving the remaining path with k — 1 (odd) vertices 
only from GL 2 we still need k — 1 + \^-] , which together with the private haplotype of the 
degree 1 vertex gives 3| haplotypes also (not even counting h t ). 

As a result we have polynomial-time solvability of MPPH(*,2)-C1. 

Theorem 6: MPPH(*, 2) is solvable in polynomial time if the compatibility graph is a clique. 



IV. Approximation algorithms 

In this section we construct polynomial time approximation algorithms for PH and MPPH, 
where the accuracy depends on the number of 2's per column of the input matrix. We describe 
genotypes without 2's as trivial genotypes, since they have to be resolved in a trivial way by one 
haplotype. Genotypes with at least one 2 will be described as nontrivial genotypes. We write 
PH nt and MPPH nt to denote the restricted versions of the problems where each genotype is 
nontrivial. We make this distinction between the problems because we have better lower bounds 
(and thus approximation ratios) for the restricted variants. 

A. PH and MPPH where all input genotypes are nontrivial 

To prove approximation guarantees we need good lower bounds on the number of haplotypes 
in the solution. We start with two bounds from [17], whose proof we give because the first one 
is short but based on a crucial observation, and the second one was incomplete in [17]. We use 
these bounds to obtain a different lower bound that we need for our approximation algorithms. 

Lemma 7: [17] Let G be an n x m instance of PH nt (or MPPH nt ). Then at least 

'1 + VI + 8n" 



LB sqrt (n) 



19 



haplotypes are required to resolve G. 

Proof: The proof follows directly from the observation that q haplotypes can resolve at 
most (*) = q(q — l)/2 nontrivial genotypes. 

■ 

Lemma 8: [17] Let G be an n x m instance of PH nt (*, £), for some £ > 1, such that the 



compatibility graph of G is a clique. Then at least 

2n 

LB sha (n, 



+ 1 



. ' + 1 

haplotypes are required to resolve G. 

Proof: Recall that, after relabeling if necessary, the trivial haplotype h t is the all-0 haplotype 
and is consistent with all genotypes. Suppose a solution of G has q non-trivial haplotypes. 
Observe that h t can be used in the resolution of at most q genotypes. Also observe (by Lemma 
5 in [17]) that each non-trivial haplotype can be used in the resolution of at most £ genotypes. 
Now distinguish two cases. First consider the case where h t is in the solution. Then from the 
two observations above it follows that n < (q + £q)/2 and hence the solution consists of at 
least q + 1 > 2n/(£ + 1) + 1 haplotypes. Now consider the second case i.e. where h t is not 
in the solution. Then we have that n < £q/2 and hence that the solution consists of at least 
2n/£ haplotypes. If n > £(£ + l)/2 we have that 2n/£ > 2n/(£+l) + 1, and the claim follows. 



If n < £{£ + l)/2 then this implies that £ > 



q> 



. Combining this with that by Lemma [7] 
gives that (£ + l)(q - 1) > \{y/l + 8n + + 8n - 1), which is equal to 2n. 

It follows that q > 2n/(£ + 1) + 1. 

■ 

The LB sha bound has been proven only for PH nt (and MPPH nt ) instances where the compat- 
ibility graph is a clique. We now prove a different bound which, in terms of cliques, is slightly 
weaker (for large n) than LB sha , but which allows us to generalise the bound to more general 
inputs. (Indeed it remains an open question whether LB sha applies as a lower bound not just 
for cliques but also for general instances.) 



Lemma 9: Let G be an n x m instance of PH (* 



LB^ id ( 



n,£) 



2(n + 



for some £ > 1. Then at least 



+ 3) 



(1) 



haplotypes are required to resolve G. 
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Proof: Let C(G) be the compatibility graph of G. We may assume without loss of generality 
that C(G) is connected. First consider the case where C(G) is a clique. If n > £{£ + l)/2, it 
suffices to notice that LB^ id (n,£) < LB sha (n,£) for each value of £ > 1, since the function 

2n . , 2(n + £)(£+!) 



/(") 



+ 1 



(2) 



£+1 ^£(£+3) - u - 



£ + 1 ' ~ £(£ + 3) 
is equal to if n = £{£ + l)/2 and has nonnegative derivative f'(n) 
Secondly, if 1 < n < £(£ + 1)/2, straightforward but tedious calculations show that for all I > 1 
the function 

F(n) = 1 + v /r + ^ _ 2(n +/ )(£ + 1} (3) 



2 + 3) 

has value for n = £{£ + l)/2 and for some n in the interval [0, 1], whereas in between these 
values it has positive value. Hence, LB^ id (n,£) < LB sqrt (n) for 1 < n < £{£ + l)/2. 

To prove that the bound also holds if C(G) is not a clique we use induction on n. Suppose 
that for each n' < n the lemma holds for all n' xm instances G of PH nt (*, £') for every m and 
£'. Since C(G) is not a clique there exist two genotypes g 1 and g 2 in G and a column j such 
that gi(j) = and c/ 2 ( J ) = 1. Given that G is a PH nt (*,£) instance t < £ genotypes have a 2 
in column j. Deleting these t genotypes yields an instance G d with disconnected compatibility 
graph C(G d ), since the absence of a 2 in column j prevents the existence of any path from g 1 
to g 2 . Let C(G d ) have p > 2 components C(Gi), ...,C(G P ), and let > 1 denote the number 
of genotypes in Gj. Thus, n = ni + ... + n p + 1. We use the induction hypothesis on G±, . . . , G p 
to conclude that the number of haplotypes required to resolve G is at least 



E 

i=i 



+ D 



£{£ + 3) 



> 



> 



2(ELi^ + K)(i + i) 

+ 3) 

2(J2 P l=1 n t + t + £)(£ + ! 
£(£ + 3) 



> 



2(Zl 1 n t + 2£)(£+l) 
£(£ + 3) 

2(n + £)(£+!) 



1(1 + 3) 



Corollary 1: Let G be an n x m instance of PH nt (*, £) or MPPH nt (*, £), for some £ > 1 



Any feasible solution for G is within a ratio £ + 2 — from optimal. 

Proof: Immediate from the fact that any solution for G has at most 2n haplotypes. In the 
case of MPPH we can check whether feasible solutions exist, and if so obtain such a solution, 
by using the algorithm in for example [7]. 
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Not surprisingly, better approximation ratios can be achieved. The following simple algorithm 
computes approximations of PH nt (*,£). (The algorithm does not work for MPPH, however.) 

Algorithm: PH nt M 

Step 1: construct the compatibility graph C(G). 
Step 2: find a maximal matching M in C(G). 

Step 3: for every edge {(71,(72} £ M, resolve g\ and g 2 by in total 3 haplotypes: any haplotype 
consistent with both gi and g 2 , and its complements with respect to gi and g 2 . 
Step 4: resolve each remaining genotype by two haplotypes. 

Theorem 7: PH nt M computes a solution to PH nt (*,£) in polynomial time within an ap- 
proximation ratio of c(£) — |£ + | — |^rj, for every i>l. 

Proof: Since constructing C(G) given G takes 0(n 2 m) time and finding a maximal 
matching in any graph takes linear time, 0(n 2 m) running time follows directly. 

Let q be the size of the maximal matching. Then PH nt M gives a solution with 3g + 2(n — 2q) 
= 2n — q haplotypes. Since the complement of the maximal matching is an independent set of 
size n — 2q, any solution must contain at least 2{n — 2q) haplotypes to resolve the genotypes 
in this independent set. The theorem thus holds if < c{£). If > c(£), implying that 

q > iZ^cit) 71 ' we use me l° wer bound of Lemma [9] to obtain 

2n-g 2n-fEgfn (2n - fE§§ n)£(£ + 3) _ 3ic[i) i + 3 _ 
LB^ td (n, I) < LB%Jn, i) < 2n{t + 1) 4c(£) - 1 1 + 1 CU ' 

The last equality follows directly since (4c(£) -!)(£ + !) = 31(1 + 3). 



B. PH and MPPH where not all input genotypes are nontrivial 

Given an instance G of PH or MPPH containing n genotypes, n nt denotes the number of 
nontrivial genotypes in G and n t the number of trivial genotypes; clearly n = n nt + n t . 

Lemma 10: Let G be an n x m instance of PH(*, £), for some £ > 2, where the compatibility 
graph of the nontrivial genotypes in G is a clique, G is not equal to a single trivial genotype, 
and no nontrivial genotype in G is the sum of two trivial genotypes in G. Then at least 



LB mi An 



n 



+ 1 
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TABLE II 

Case n t < 4, n nt <£m proof of Lemma[To1 



n t 
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[n/£ + 11 
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< r 2 / 2 + 1] = 2 
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z > 2 


< \(z+ l)/z + 1] = 3 
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< 3 
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z > 2 


< r(z + 2)/ 2 + 1] = 3 


3 





< 3 


3 
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< 3 


3 


2 


< 4 


3 


z > 3 


< \(z + 3)/z + l] = 3 



haplotypes are needed to resolve G. 

Proof: Note that the lemma holds if n t > n/i + 1. So we assume from now on that 

n t < n/£ + 1. 

We first prove that the bound holds for n nt < t. Combining this with n t < n/2 + 1 gives that 
n < 2£ + 2. Thus n/E + 1 < 4. Hence if n t > 4 then we are done. Thus we only have to consider 
cases where both n t G {0, 1, 2, 3} and I > max{2, n nt }. We verify these cases in Table HH note 
the importance of the fact that no nontrivial genotype is the sum of two trivial haplotypes in 
verifying that these are correct lower bounds. (Also, there is no nt — 1, n nt = case because 
of the lemma's precondition.) 

We now prove the lemma for n nt > £. Note that in this case there exists a unique trivial 
haplotype h t consistent with all nontrivial genotypes. Suppose, by way of contradiction, that 
N = N t + N nt is the size of the smallest instance G for which the bound does not hold. Let H 
be an optimal solution for G and let h = \H\. 

Observe firstly that N — 1 (mod t), because if this is not true we have that LB mid (N — 1, £) = 
LB mi d(N, t) and we can find a smaller instance for which the bound does not hold, simply by 
removing an arbitrary genotype from G, contradicting the minimal choice of N. 
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Similarly we argue that h = LB mi( i(N, £) — l, since if h < LB mi d(N, £) — 2 we could remove 
an arbitrary genotype to yield a size N — 1 instance and still have that h < LB mi d(N — 1,£). 

We choose a specific resolution of G' using H and represent it as a haplotype graph. The 
vertices of this graph are the haplotypes in H . For each nontrivial genotype g 6 G' there is an 
edge between the two haplotypes that resolve it. For each trivial genotype g E G' there is a loop 
on the corresponding haplotype. There are no edges between looped haplotypes because of the 
precondition that no nontrivial genotype is the sum of two trivial genotypes. 

From Lemma 5 of [17] it follows that, with the exception of the possibly present trivial 
haplotype and disregarding loops, each haplotype in the graph has degree at most £. In addition, 
if an unlooped haplotype has degree less than or equal to £, or a looped haplotype has degree 
(excluding its loop) strictly smaller than £, then deleting this haplotype and all its at most £ 
incident genotypes creates an instance G" containing at least N — I genotypes that can be 
resolved using h — 1 haplotypes, yielding a contradiction to the minimality of N. (Note that, 
because N nt > £, it is not possible that the instance G" is empty or equal to a single trivial 
genotype.) 

The only case that remains is when, apart from the possibly present trivial haplotype, every 
haplotype in the haplotype graph is looped and has degree £ (excluding its loop). However, 
there are no edges between looped vertices and they can therefore only be adjacent to the trivial 
haplotype, yielding a contradiction. 

■ 

Lemma 11: Let G be an n x m instance of PH(*, £), for some £ > 2, where G is not equal 
to a single trivial genotype, and no nontrivial genotype in G is the sum of two trivial genotypes 
in G. Then at least LB mi d(n, £) haplotypes are needed to resolve G. 

Proof: Essentially the same inductive argument as used in Lemma [9] works: it is always 
possible to disconnect the compatibility graph of G into at least two components by removing 
at most £ nontrivial genotypes, and using cliques as the base of the induction. The presence 
of trivial genotypes in the input (which we can actually simply exclude from the compatibility 
graph) does not alter the analysis. The fact that (in the inductive step) at least two components 
are created, each of which contains at least one nontrivial genotype, ensures that the inductive 
argument is not harmed by the presence of single trivial genotypes (for which the bound does 
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not hold). 

■ 

Corollary 2: Let G be an n x m instance of PH(*, £) or MPPH(*, £), for some £ > 2. Any 
feasible solution for G is within a ratio of 2£ from optimal. 

Proof: Immediate because 2n/(n/£ + 1) < 2£. (As before the algorithm from e.g. [7] can 
be used to generate feasible solutions for MPPH, or to determine that they do not exist.) 

■ 

The algorithm PH nt M can easily be adapted to solve PH(*,£) approximately. 
Algorithm: PHM 

Step 1: remove from G all genotypes that are the sum of two trivial genotypes 
Step 2: construct the compatibility graph C(G') of the leftover instance G'. 
Step 3: find a maximal matching M in C(G'). 

Step 4: for every edge {#1,(72} £ M, resolve g\ and g 2 by three haplotypes if g\ and g 2 are 
both nontrivial and by two haplotypes if one of them is trivial. 

Step 5: resolve each remaining nontrivial genotype by two haplotypes and each remaining trivial 
genotype by its corresponding haplotype. 

Theorem 8: PHM computes a solution to PH(*,£) in polynomial time within an approxi- 
mation ratio of d(£) — \t + |, for every £ > 2. 

Proof: Since constructing C(G) given G takes 0(n 2 m) time and finding a maximal 
matching in any graph takes linear time, 0(n 2 m) running time follows directly. 

Let q be the size of the maximal matching, n the number of genotypes after Step 1 and n t the 
number of trivial genotypes in G' . Then PHM gives a solution with 2n — q — n t haplotypes. 
Since the complement of the maximal matching is an independent set of size n — 2q in C(G'), any 
solution must contain at least 2{n — 2q) haplotypes to resolve the genotypes in this independent 
set. The theorem thus holds if 2n ~l~^ < d{£). If > d{£), implying that q > {d %^ nt , 

we use the lower bound of Lemma [TT] and obtain 

O (d(e)-2)n+n t „ (d(£)-2)n 

2n-q-n t zn gdffl-i z " 2d(e)~i 3d(£)£ _ 
LB mid (n,£) < Vj + 1] < f ~ 2d(£) - 1 ~ U ' 

The last equality follows directly since 2d[€) — 1 = 3£. 
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V. POSTLUDE 

There remain a number of open problems to be solved. The complexity of PH(*,2) and 
MPPH(*, 2) is still unknown. An approach that might raise the necessary insight is to study 
the PH(*, 2)-Cq and MPPH(*, 2)-Cq variants of these problems (i.e. where the compatibility 
graph is the sum of q cliques) for small q. If a complexity result nevertheless continues to be 
elusive then it would be interesting to try and improve approximation ratios for PH(*, 2) and 
MPPH(*, 2); might it even be possible to find a PTAS (Polynomial-time Approximation Scheme) 
for each of these problems? Note also that the complexity of PH(k, 2) and MPPH(k, 2) remains 
open for constant k > 3. 

Another intriguing open question concerns the relative complexity of PH and MPPH in- 
stances. Has PH(k,£) always the same complexity as MPPH(k,£), in terms of well-known 
complexity measurements (polynomial-time solvability, NP-hardness, APX-hardness)? For hard 
instances, do approximability ratios differ? A related question is whether it is possible to directly 
encode PH instances as MPPH instances, and/or vice-versa, and if so whether/how this affects 
the bounds on the number of 2's in columns and rows. 

For hard PH(k : £) instances it would also be interesting to see if those approximation algo- 
rithms that yield approximation ratios as functions of k, can be intelligently combined with the 
approximation algorithms in this paper (having approximation ratios determined by £), perhaps 
with superior approximation ratios as a consequence. In terms of approximation algorithms for 
MPPH there is a lot of work to be done because the approximation algorithms presented in 
this paper actually do little more than return an arbitrary feasible solution. It is also not clear 
if the 2 fc ~ 1 -approximation algorithms for PH(k, *) can be attained (or improved) for MPPH. 
More generally, it seems likely that big improvements in approximation ratios (for both PH and 
MPPH) will require more sophisticated, input- sensitive lower bounds and algorithms. What 
are the limits of approximability for these problems, and how far will algorithms with formal 
performance-guarantees (such as in this paper) have to improve to make them competitive with 
dominant ILP-based methods? 

Finally, with respect to MPPH, it could be good to explore how parsimonious the solutions 
are that are produced by the various PPH feasibility algorithms, and whether searching through 
the entire space of PPH solutions (as proposed in [19]) yields practical algorithms for solving 
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MPPH. 
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